[Data] Store _source_paths in object store rather than self to prevent excessive spilling during read task serialization#59999
Conversation
There was a problem hiding this comment.
Code Review
This pull request reverts a change that added _source_paths to FileBasedDatasource and ParquetDatasource for lineage tracking. As you've noted in the description, this change introduced a significant performance regression due to the increased serialization size of read tasks, leading to excessive object spilling. The revert correctly removes the problematic _source_paths attribute and the corresponding test assertions. This is a necessary fix for the performance issue, and I approve of this change. Given that this re-introduces the issue with lineage tracking, it would be beneficial to create a follow-up ticket to explore a more performant solution for tracking source paths.
| @@ -328,8 +326,7 @@ def __init__( | |||
| self._local_scheduling = NodeAffinitySchedulingStrategy( | |||
| ray.get_runtime_context().get_node_id(), soft=False | |||
| ) | |||
There was a problem hiding this comment.
@lee1258561 We can't remove this attribute cause there are other dependencies that need it.
What I recommend instead is to call ray.put() on the paths and then have a property based function called _source_paths. That way only the ObjectRef is serialized and not all the paths.
You can look how we do this for _paths in file_based_datasource
Link:
ray/python/ray/data/datasource/file_based_datasource.py
Lines 230 to 231 in a5d3e3c
There was a problem hiding this comment.
I changed the PR based on your suggestion. However, I did not see any other place in this repo referencing _source_paths except the three places that is added in this PR #55978.
Please help me double check if any other places I need to be aware of. Thanks!
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
15478ac to
02f68f9
Compare
)" This reverts commit 460bce4. Signed-off-by: Raymond Lee <lee1258561@gmail.com>
…t ref in self Signed-off-by: Raymond Lee <lee1258561@gmail.com>
02f68f9 to
9581646
Compare
|
done signing commit |
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
lgtm! Thanks for the change
…t excessive spilling during read task serialization (ray-project#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: ray-project#55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues ray-project#55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Description
This PR reintroducing the issue that A previous PR trying to solve: #55978
Specifically, this add a full list of paths to
selfand for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling.We face the similar warning and have spilling behavior:
Revert can solve the issue but this attribute seems to have other dependancy.
This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed
Related issues
#55978
Additional information